Low-level optimization tips |
Now that you've chosen a fast, low-complexity algorithm, it's time to optimize the #*($& out of it. Read on...
The 90/10 rule |
Never, ever forget that 90% of your CPU time will be spent in 10% of your code.
Needless to say, you should never optimize until you know what is taking the most CPU power in your filter. Fortunately, this rule is more like 99/1 for video and you will typically get most of your gain by optimizing the inner row loop. Do this first, and continue to optimize only if significant time is being taken in outer loops.
Floating point is bad |
Floating point is too slow for VirtualDub filters.
It is true that, with the advent of the Pentium, floating point is much faster than it used to be, but the truth is that there really isn't a modern x86 CPU in use that won't execute a fixed-point implementation faster than a floating-point implementation, for the following reasons:
There are circumstances where the precision and range of floating-point numbers is necessary for an image algorithm, in which case a fixed-point algorithm would be poor or impossible. However, in most cases 16-bits or 32-bits per channel will suffice with proper rounding and give much better speed. Basically all known fast algorithms for the inverse discrete cosine transform (IDCT) are implemented as fixed-point on Intel, and few VirtualDub filters are as complex as the IDCT.
You should be especially careful, though, when coding infinite impulse response (IIR) filters, however -- they are very sensitive to roundoff noise and may not work even with 32-bit fixed point. FIR filters tend to be more forgiving since they don't compound errors, but unless your coefficients are basic multiples of powers of two, you're better off doing 32-bit accumulation without intermediate rounding.
SSE is bad |
In general, 128-bit SSE instructions are also inappropriate for VirtualDub filters.
There are three major problems with the 128-bit Streaming SIMD Extensions (SSE) instructions introduced in the Pentium III:
The first restriction kicks out 90% of the users, the second kills performance for loads and stores, and the third basically ensures that you will be able to execute 2-4 times more scalar or MMX instructions than SSE ones. You should thus probably avoid SSE for filters unless you have complex arithmetic or high range/precision requirements that force floating-point math. Otherwise, concentrate on MMX instead.
Note that SSE instructions which operate on scalar or MMX registers -- which I call the "integer SSE" instructions -- are supported on the Athlon and do not have the performance or data bottlenecks described above. Do use these if possible!
Avoid division |
Division is a very expensive operation -- it is the only member of the four basic operators that is not easily parallelized, and takes much longer to execute than addition or multiplication on just about any modern CPU. To give you some idea:
CPU/Instruction set | Addition | Multiplication | Division |
---|---|---|---|
Scalar Pentium | 0.5 clock | 10 clocks | 46 clocks |
MMX Pentium | 0.5 clock | 3 clocks (2 overlappable) | -- |
FPU Pentium | 3 clocks (2 overlappable) | 1 clock (2-3 overlappable) | -- |
Scalar PPro/PII/PIII | 0.5 clock | 1 clock (3 overlappable) | 39 clocks (2 overlappable) |
MMX PPro/PII/PIII | 0.5 clock | 3 clocks (2 overlappable) | -- |
FPU PPro/PII/PIII | 3 clocks (2 overlappable) | 5 clocks (4 overlappable) | 38 clocks (1 overlappable) |
(This data is from Agner Fog's incredible Pentium optimization tome.)
You get the idea: divides are very, very costly. Visual C++ can, in some cases, convert a constant integer divide to a constant multiply with rounding, the integer equivalent of multiplying by the reciprocal, but it will not be able to do this for variable divides. You can emulate this yourself by creating a table of reciprocals and rounding constants, and doing the multiply/add manually, but this is best done in assembly language since C does not provide fast access to 32x32>64 multiplies. It is even better if you can convert the divide to a shift.
Similarly, you should avoid square roots in inner loops, and especially any complex or transcendental functions such as hypot or cos. Note that you should only worry if you are doing a lot of these operations; if you are doing two divides per frame, don't worry about it.
Pixel accumulation |
A lot of image processing algorithms require you to accumulate pixel sums. The most straightforward way is to use separate integer accumulators for each channel; however, you can gain some speed by accumulating channels in parallel; do this by accumulating red and blue together, and by not shifting green:
rb_sum += (pixel & 0xff00ff);
g_sum += (pixel & 0xff00);
This allows you to accumulate up to 257 unscaled pixels. Pixels can be prescaled by regular integer multiplication, and if you restrict yourself to 256 unscaled pixels, you can round the result as well:
rb_sum = 0x040004;
g_sum = 0x0400;
for(int x=0; x<8; x++) {
Pixel32 p = src[x];
rb_sum += p&0xff00ff;
g_sum += p&0xff00;
}
*dst++ = ((rb_sum & 0x07f807f8) + (g_sum & 0x0007f800)) >> 3;
This trick is moot, of course, for MMX, where you basically get 16-bit channel accumulation simply by unpacking to words with the punpcklbw instruction.
If you have to accumulate the alpha channel as well, you will have to sacrifice some speed to shift the alpha and green channels down. Also, this trick won't work if your scaling coefficients don't add up to a power of two, because you can't shift to do the division; in this case, you'll have to separate the rb_sum result into separate channels before doing the non-power-of-two divide.
Kicking out the loop counter |
Loop counters are expensive in scalar code because of the Intel register shortage, so you should always
look for ways to eliminate one or more counting variables. The first way is to use the destination pointer as the
counter:
dst_limit = dst + w;
do {
...
} while(++dst < dst_limit);
If you have multiple pointers which are walking the same amounts, you can also convert some into pointer offsets:
ptrdiff_t srcoffset = src - dst;
Pixel32 *dst_limit = dst + w;
do {
p = *(dst + srcoffset);
...
*dst++ = p;
} while(dst < dst_limit);
You can also access memory by constant offsets from the counter:
x = -w;
dst += w;
do {
...
dst[x] = result;
} while(++x);
Averaging pixels |
The most straightforward way to average pixels is to separate the pixels into channels, average all the channels
together, and refold the channels. This is sloooowwww. You can get better performance by averaging
all channels in parallel:
pf = ((p1&0xfefefefe)>>1) + ((p2>>1)&0x7f7f7f7f) + (p1&p2&0x01010101); /* round down */
pf = ((p1&0xfefefefe)>>1) + ((p2>>1)&0x7f7f7f7f) + ((p1|p2)&0x01010101); /* round up */
You'll notice that p1 and p2 have opposite shift-mask orders. This is done to help the compiler reduce pressure on the shifting unit; there are two ALUs in the CPU but only one shift unit.
Don't drop the rounding term! If you try, and end up with code like this:
Your average will be biased toward zero, and more importantly, it will be impossible to achieve a FF
result.
pf = ((p1&0xfefefefe)>>1) + ((p2>>1)&0x7f7f7f7f);
Alignment |
VirtualDub guarantees that all input and output scanlines are aligned to 4-byte boundaries, but not to 8-byte
boundaries. You need to be careful about this if you are using MMX code, since 64-bit MMX memory accesses
can incur misalignment penalties. Therefore, for best results, you should replace a move-unpack instruction
pair:
with a dual load pair:
movq mm0,[esi]
pxor mm7,mm7
movq mm1,mm0
punpcklbw mm0,mm7
punpckhbw mm1,mm7
Unpacked loads can be changed from:
movd mm0,[esi]
pxor mm7,mm7
movd mm1,[esi+4]
punpcklbw mm0,mm7
punpckhbw mm1,mm7
to:
movq mm0,[esi]
Intel and AMD CPUs only load dwords on a punpckldq instruction, so this is safe. Note that this
is not true for the punpckhdq instruction.
movd mm0,[esi]
punpckldq mm0,[esi+4]
VirtualDub external filter SDK 1.05 | ©1999-2001 Avery Lee <phaeron@virtualdub.org> |